HW-2: Tabular RL

A DSAN 6650 Homework

Authors

Kangheng Liu

Billy McGloin

Published

October 31, 2024

       View this project on GitHub   

Introduction

We created a python package named tabula that implements three games (Environments) with three traditional reinforcement learning algorithms.

import argparse
import pygame
import sys
import os
from IPython.display import display, Image
from tabula.environments import *  
from tabula.solvers import *  
from tabula.utils import Utils
pygame 2.6.1 (SDL 2.28.4, Python 3.11.9)
Hello from the pygame community. https://www.pygame.org/contribute.html

Environments & Solvers

Here we created a function to train and visualize a policy given an environment and a solver. The function passes in assorted arguments, such as the number of episodes, max steps per episode, and the output filepaths. As verbose is set to True, for each environment-solver pair, assorted metrics are printed, the convergence plot is displayed, we render the optimal policy, and a gif is displayed showing the optimal policy in action.

# set parameters (same for all solvers and envs)
verbose = True
save_metrics = True

# function to train and visualize policy for a given environment and solver
def train_and_visualize_policy(env, solver, episodes, max_steps, verbose, save_metrics, gif_filename, image_filename, convergence_plot_filename):
    # train agent
    policy = solver.train(max_steps=max_steps, episodes=episodes, verbose=verbose)

    # print policy
    print("\nOptimal Policy:")
    print(policy)

    # render optimal policy
    Utils.render_optimal_policy(
        env, policy, save_image=save_metrics, image_filename=image_filename
    )

    # run optimal policy
    Utils.run_optimal_policy(
        env, policy, save_gif=save_metrics, gif_filename=gif_filename
    )

    # plot convergence plot
    Utils.plot_convergence(solver.mean_reward, file_path=convergence_plot_filename)

Boat Environment

# set environment
env = BoatEnv()

Dynamic Programming

# define output paths
image_filename = os.path.join("./outputs", "optim_policy_boat_dp.png")
gif_filename = os.path.join("./outputs", "gameplay_boat_dp.gif")
convergence_plot_filename = os.path.join("./outputs", "convergence_plot_boat_dp.png")
# initialize solver
solver = DynamicProgramming(env)

# set parameters
episodes = 500
max_steps = 50

train_and_visualize_policy(env, solver, episodes, max_steps, verbose, save_metrics, gif_filename, image_filename, convergence_plot_filename)
Running simulation for 500 episodes...
Episode 1/500
Episode 51/500
Episode 101/500
Episode 151/500
Episode 201/500
Episode 251/500
Episode 301/500
Episode 351/500
Episode 401/500
Episode 451/500
Simulation complete.

Average reward during random simulation: 83.28
Transition Model (p(s', r | s, a)):
State 0, Action 0:
    Next State: 0, Reward: 0, Probability: 0.698
    Next State: 0, Reward: 1, Probability: 0.302
Transition Model (p(s', r | s, a)):
State 0, Action 0:
    Next State: 0, Reward: 0, Probability: 0.698
    Next State: 0, Reward: 1, Probability: 0.302
State 0, Action 1:
    Next State: 0, Reward: 1, Probability: 0.295
    Next State: 1, Reward: 2, Probability: 0.705
Transition Model (p(s', r | s, a)):
State 0, Action 0:
    Next State: 0, Reward: 0, Probability: 0.698
    Next State: 0, Reward: 1, Probability: 0.302
State 0, Action 1:
    Next State: 0, Reward: 1, Probability: 0.295
    Next State: 1, Reward: 2, Probability: 0.705
State 1, Action 0:
    Next State: 1, Reward: 0, Probability: 0.694
    Next State: 0, Reward: 2, Probability: 0.306
Transition Model (p(s', r | s, a)):
State 0, Action 0:
    Next State: 0, Reward: 0, Probability: 0.698
    Next State: 0, Reward: 1, Probability: 0.302
State 0, Action 1:
    Next State: 0, Reward: 1, Probability: 0.295
    Next State: 1, Reward: 2, Probability: 0.705
State 1, Action 0:
    Next State: 1, Reward: 0, Probability: 0.694
    Next State: 0, Reward: 2, Probability: 0.306
State 1, Action 1:
    Next State: 1, Reward: 4, Probability: 0.308
    Next State: 1, Reward: 3, Probability: 0.692

Starting value iteration...
Iteration 0: Mean Value = 2.506, Max Delta = 3.308
Value iteration converged after 78 iterations

Final State Values:
State 0: 30.889
State 1: 33.071

Optimal Policy:
[[0. 1.]
 [0. 1.]]
Optimal policy visualization saved as ./outputs/optim_policy_boat_dp.png
Gameplay GIF saved as ./outputs/gameplay_boat_dp.gif
Average reward following optimal policy: 164.60
Convergence plot saved as ./outputs/convergence_plot_boat_dp.png

# optimal policy
display(Image(filename=image_filename))

# gameplay showing optimal policy
display(Image(filename=gif_filename))
<IPython.core.display.Image object>

gameplay_boat_dp.gif

gameplay_boat_dp.gif

Monte Carlo

# define output paths
image_filename = os.path.join("./outputs", "optim_policy_boat_mc.png")
gif_filename = os.path.join("./outputs", "gameplay_boat_mc.gif")
convergence_plot_filename = os.path.join("./outputs", "convergence_plot_boat_mc.png")
# initialize solver
solver = MonteCarlo(env)

# set parameters
episodes = 100
max_steps = 50

train_and_visualize_policy(env, solver, episodes, max_steps, verbose, save_metrics, gif_filename, image_filename, convergence_plot_filename)
Starting Monte Carlo ES training for 100 episodes...
Episode 10/100 - Average Return: 108.90, Average Q-Value Update: 12.5477
Episode 20/100 - Average Return: 156.90, Average Q-Value Update: 2.8513
Episode 30/100 - Average Return: 154.60, Average Q-Value Update: 1.4027
Episode 40/100 - Average Return: 154.90, Average Q-Value Update: 1.0536
Episode 50/100 - Average Return: 155.70, Average Q-Value Update: 0.4959
Episode 60/100 - Average Return: 156.50, Average Q-Value Update: 0.4762
Episode 70/100 - Average Return: 152.60, Average Q-Value Update: 1.0448
Episode 80/100 - Average Return: 153.80, Average Q-Value Update: 0.3804
Episode 90/100 - Average Return: 158.10, Average Q-Value Update: 0.3425
Episode 100/100 - Average Return: 155.00, Average Q-Value Update: 0.2713

Action distribution across episodes: {0: '0.081', 1: '0.919'}
Final Average Return: 150.70
Final Average Q-Value Update: 2.0866
Final Action Values (Q):
 [[ 73.375      134.41791045]
 [ 98.68817204 152.13265306]]

Optimal Policy:
[[0. 1.]
 [0. 1.]]
Optimal policy visualization saved as ./outputs/optim_policy_boat_mc.png
Gameplay GIF saved as ./outputs/gameplay_boat_mc.gif
Average reward following optimal policy: 160.60
Convergence plot saved as ./outputs/convergence_plot_boat_mc.png

# optimal policy
display(Image(filename=image_filename))

# gameplay showing optimal policy
display(Image(filename=gif_filename))
<IPython.core.display.Image object>

gameplay_boat_mc.gif

gameplay_boat_mc.gif

Temporal Difference

# define output paths
image_filename = os.path.join("./outputs", "optim_policy_boat_td.png")
gif_filename = os.path.join("./outputs", "gameplay_boat_td.gif")
convergence_plot_filename = os.path.join("./outputs", "convergence_plot_boat_td.png")
# initialize solver
solver = TemporalDifference(env)

# set parameters
episodes = 250
max_steps = 25

train_and_visualize_policy(env, solver, episodes, max_steps, verbose, save_metrics, gif_filename, image_filename, convergence_plot_filename)
Training Temporal Difference algorithm for 250 episodes...
Episode 1/250 - Average Return: 8.00, Average Q-Value Update: 0.0329
Episode 26/250 - Average Return: 69.04, Average Q-Value Update: 0.1394
Episode 51/250 - Average Return: 78.08, Average Q-Value Update: 0.0502
Episode 76/250 - Average Return: 76.32, Average Q-Value Update: 0.0598
Episode 101/250 - Average Return: 77.28, Average Q-Value Update: 0.0504
Episode 126/250 - Average Return: 77.92, Average Q-Value Update: 0.0564
Episode 151/250 - Average Return: 74.88, Average Q-Value Update: 0.0268
Episode 176/250 - Average Return: 77.32, Average Q-Value Update: 0.0537
Episode 201/250 - Average Return: 75.48, Average Q-Value Update: 0.0742
Episode 226/250 - Average Return: 76.60, Average Q-Value Update: 0.0855
Training complete! Action distribution across episodes: [0.06656 0.93344]
Final Action Values (Q):
 [[16.77819255 29.04173809]
 [27.88650034 31.35323269]]

Optimal Policy:
[[0. 1.]
 [0. 1.]]
Optimal policy visualization saved as ./outputs/optim_policy_boat_td.png
Gameplay GIF saved as ./outputs/gameplay_boat_td.gif
Average reward following optimal policy: 162.80
Convergence plot saved as ./outputs/convergence_plot_boat_td.png

# optimal policy
display(Image(filename=image_filename))

# gameplay showing optimal policy
display(Image(filename=gif_filename))
<IPython.core.display.Image object>

gameplay_boat_td.gif

gameplay_boat_td.gif

Grid World Environment

# set environment
env = GridWorldEnv()

Dynamic Programming

# define output paths
image_filename = os.path.join("./outputs", "optim_policy_gridworld_dp.png")
gif_filename = os.path.join("./outputs", "gameplay_gridworld_dp.gif")
convergence_plot_filename = os.path.join("./outputs", "convergence_plot_gridwordl_dp.png")
# initialize solver
solver = DynamicProgramming(env)

# set parameters
episodes = 5000
max_steps = 50

train_and_visualize_policy(env, solver, episodes, max_steps, verbose, save_metrics, gif_filename, image_filename, convergence_plot_filename)
Note

The output from the above cell was removed to maintain brevity and avoid rendering issues.

# optimal policy
display(Image(filename=image_filename))

# gameplay showing optimal policy
display(Image(filename=gif_filename))
<IPython.core.display.Image object>

gameplay_gridworld_dp.gif

gameplay_gridworld_dp.gif

Monte Carlo

# define output paths
image_filename = os.path.join("./outputs", "optim_policy_gridworld_mc.png")
gif_filename = os.path.join("./outputs", "gameplay_gridworld_mc.gif")
convergence_plot_filename = os.path.join("./outputs", "convergence_plot_gridworld_mc.png")
# initialize solver
solver = MonteCarlo(env)

# set parameters
episodes = 20000
max_steps = 50

train_and_visualize_policy(env, solver, episodes, max_steps, verbose, save_metrics, gif_filename, image_filename, convergence_plot_filename)
Starting Monte Carlo ES training for 20000 episodes...
Episode 2000/20000 - Average Return: 48.34, Average Q-Value Update: 0.6868
Episode 4000/20000 - Average Return: 63.64, Average Q-Value Update: 0.0921
Episode 6000/20000 - Average Return: 65.19, Average Q-Value Update: 0.0493
Episode 8000/20000 - Average Return: 66.54, Average Q-Value Update: 0.0479
Episode 10000/20000 - Average Return: 65.71, Average Q-Value Update: 0.0292
Episode 12000/20000 - Average Return: 65.60, Average Q-Value Update: 0.0223
Episode 14000/20000 - Average Return: 66.22, Average Q-Value Update: 0.0205
Episode 16000/20000 - Average Return: 65.98, Average Q-Value Update: 0.0164
Episode 18000/20000 - Average Return: 63.51, Average Q-Value Update: 0.0135
Episode 20000/20000 - Average Return: 65.55, Average Q-Value Update: 0.0130

Action distribution across episodes: {0: '0.064', 1: '0.402', 2: '0.040', 3: '0.494'}
Final Average Return: 63.63
Final Average Q-Value Update: 0.0991
Final Action Values (Q):
 [[ 49.72759022  64.58013116  50.07287933  43.73318386]
 [ 48.35670732  61.73671875  40.08422301  57.26244726]
 [  0.           0.           0.           0.        ]
 [ 50.4         56.91       -27.2         55.3125    ]
 [-30.         -50.          60.63636364  58.92307692]
 [ 35.42857143  77.75531915  75.85714286  72.25      ]
 [ 53.77849462  57.98849252  56.01513388  66.54356979]
 [ 51.26218097  68.72222792  54.93872549  51.54096639]
 [  0.           0.           0.           0.        ]
 [ 54.5         64.73410966  59.95061728 -30.4084507 ]
 [  0.           0.           0.           0.        ]
 [ 73.37037037  82.63113839 -16.          69.7704918 ]
 [ 65.31811451  49.53723404  17.98263889  56.58579882]
 [ 49.78335373  65.61851332  60.73800259  71.12232525]
 [ 72.50140056  71.05419355  70.83906465  73.3743108 ]
 [ 61.77633478  74.39041096  73.0147651   75.33810036]
 [-26.3457189   69.27472527  67.57608696  81.21704606]
 [ 81.61102362  93.86234423  77.8903437   87.44444444]
 [ 59.64414414   7.03333333 -11.34848485  33.96969697]
 [ 65.92689531  52.11940299  48.27777778  59.39240506]
 [  0.           0.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [ 91.55798319  97.61262261  95.75605816  96.43491577]
 [-31.         -45.         -16.42307692  48.22619048]
 [ 63.26415094   7.88888889  29.66666667  46.36363636]
 [  0.           0.           0.           0.        ]
 [ 51.13333333  40.83333333  65.625       94.92      ]
 [ 89.95348837  96.14666667  89.81967213  97.9680919 ]
 [ 96.78178368  99.41195799  96.35483871  97.62278978]
 [  0.           0.           0.           0.        ]
 [ 48.19230769  -6.          -5.66666667  16.75      ]
 [  0.           0.           0.           0.        ]
 [ 21.125       60.66666667 -21.          93.56666667]
 [ 76.91666667  97.11111111  95.28571429  98.96391753]
 [  0.           0.           0.           0.        ]]

Optimal Policy:
[[0.   1.   0.   0.  ]
 [0.   1.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   1.   0.   0.  ]
 [0.   0.   1.   0.  ]
 [0.   1.   0.   0.  ]
 [0.   0.   0.   1.  ]
 [0.   1.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   1.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   1.   0.   0.  ]
 [1.   0.   0.   0.  ]
 [0.   0.   0.   1.  ]
 [0.   0.   0.   1.  ]
 [0.   0.   0.   1.  ]
 [0.   0.   0.   1.  ]
 [0.   1.   0.   0.  ]
 [1.   0.   0.   0.  ]
 [1.   0.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.25 0.25 0.25 0.25]
 [0.25 0.25 0.25 0.25]
 [0.   1.   0.   0.  ]
 [0.   0.   0.   1.  ]
 [1.   0.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   0.   0.   1.  ]
 [0.   0.   0.   1.  ]
 [0.   1.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [1.   0.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   0.   0.   1.  ]
 [0.   0.   0.   1.  ]
 [0.25 0.25 0.25 0.25]]
Optimal policy visualization saved as ./outputs/optim_policy_gridworld_mc.png
Gameplay GIF saved as ./outputs/gameplay_gridworld_mc.gif
Average reward following optimal policy: 57.20
Convergence plot saved as ./outputs/convergence_plot_gridworld_mc.png

# optimal policy
display(Image(filename=image_filename))

# gameplay showing optimal policy
display(Image(filename=gif_filename))
<IPython.core.display.Image object>

gameplay_gridworld_mc.gif

gameplay_gridworld_mc.gif

Temporal Difference

# define output paths
image_filename = os.path.join("./outputs", "optim_policy_gridworld_td.png")
gif_filename = os.path.join("./outputs", "gameplay_gridworld_td.gif")
convergence_plot_filename = os.path.join("./outputs", "convergence_plot_gridworld_td.png")
# initialize solver
solver = TemporalDifference(env)

# set parameters
episodes = 5000
max_steps = 50

# train agent
train_and_visualize_policy(env, solver, episodes, max_steps, verbose, save_metrics, gif_filename, image_filename, convergence_plot_filename)
Training Temporal Difference algorithm for 5000 episodes...
Episode 1/5000 - Average Return: -50.00, Average Q-Value Update: 0.0978
Episode 501/5000 - Average Return: 31.36, Average Q-Value Update: 0.8395
Episode 1001/5000 - Average Return: 66.72, Average Q-Value Update: 1.1583
Episode 1501/5000 - Average Return: 69.30, Average Q-Value Update: 0.5732
Episode 2001/5000 - Average Return: 66.96, Average Q-Value Update: 1.1327
Episode 2501/5000 - Average Return: 67.03, Average Q-Value Update: 1.1048
Episode 3001/5000 - Average Return: 69.84, Average Q-Value Update: 0.9666
Episode 3501/5000 - Average Return: 67.60, Average Q-Value Update: 0.4022
Episode 4001/5000 - Average Return: 70.12, Average Q-Value Update: 0.4230
Episode 4501/5000 - Average Return: 63.74, Average Q-Value Update: 0.6936
Training complete! Action distribution across episodes: [0.05683993 0.43818418 0.04423435 0.46074153]
Final Action Values (Q):
 [[  8.72341648   9.68369613   8.79054668   9.40910806]
 [ 10.63886285  12.78777383   8.60949284  10.50969795]
 [  0.           0.           0.           0.        ]
 [ -2.45880305  13.88297287  -3.02241517  -1.80531712]
 [ -5.8039351   -9.58953659  -7.22464597   3.22536122]
 [  3.42796067  27.51419617  -0.66155983   2.7051459 ]
 [  9.52690855  11.76514427  10.98764064  11.2225228 ]
 [ 11.06282968  14.54682443  11.04215323  12.83513281]
 [  0.           0.           0.           0.        ]
 [  8.14785989  18.5267381    9.27291087 -38.99120941]
 [  0.           0.           0.           0.        ]
 [  1.41159315  47.5884669  -29.15936646  20.93240955]
 [ 11.84751083   8.51742417  13.61910617  15.35608579]
 [ 13.56350326  13.74456391  13.58716744  20.66622038]
 [ 21.23891944  21.29802574  18.1315687   27.92089174]
 [ 21.51567131  23.26719417  22.73122029  31.72016518]
 [-28.77979039  30.39968769  24.76819846  33.2888013 ]
 [ 39.68317194  61.59173028  42.11074938  49.20087258]
 [ -5.20488254  -1.66227523  -2.38974999  11.30650922]
 [ 15.06879707   5.67718287   7.74945787  10.35020387]
 [  0.           0.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [ 56.99041781  74.98418695  68.10120506  66.86630057]
 [-14.60166044 -12.87685618 -15.08364294  -0.44665007]
 [  9.96127095  -8.81427299  -7.87742456  -0.92478411]
 [  0.           0.           0.           0.        ]
 [  3.74158135   3.64915687  11.47980923  62.25886438]
 [ 34.38420928  31.87729863  45.20378749  77.95212932]
 [ 74.52270699  94.40718019  72.29216022  79.42938453]
 [  0.           0.           0.           0.        ]
 [ -5.96275722 -14.95046463 -23.42795    -14.65575544]
 [  0.           0.           0.           0.        ]
 [ -0.17809818   2.06812564  -0.17667356  27.34111762]
 [ 58.73746094   0.           2.49262456   5.5793068 ]
 [  0.           0.           0.           0.        ]]

Optimal Policy:
[[0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]]
Optimal policy visualization saved as ./outputs/optim_policy_gridworld_td.png
Gameplay GIF saved as ./outputs/gameplay_gridworld_td.gif
Average reward following optimal policy: 57.40
Convergence plot saved as ./outputs/convergence_plot_gridworld_td.png

# optimal policy
display(Image(filename=image_filename))

# gameplay showing optimal policy
display(Image(filename=gif_filename))
<IPython.core.display.Image object>

gameplay_gridworld_td.gif

gameplay_gridworld_td.gif

Geosearch Environment

# set environment
env = GeosearchEnv()

Dynamic Programming

# define output paths
image_filename = os.path.join("./outputs", "optim_policy_geosearch_dp.png")
gif_filename = os.path.join("./outputs", "gameplay_geosearch_dp.gif")
convergence_plot_filename = os.path.join("./outputs", "convergence_plot_geosearch_dp.png")
# initialize solver
solver = DynamicProgramming(env)

# set parameters
episodes = 1000
max_steps = 25

train_and_visualize_policy(env, solver, episodes, max_steps, verbose, save_metrics, gif_filename, image_filename, convergence_plot_filename)
Note

The output from the above cell was removed to maintain brevity and avoid rendering issues.

# optimal policy
display(Image(filename=image_filename))

# gameplay showing optimal policy
display(Image(filename=gif_filename))
<IPython.core.display.Image object>

gameplay_geosearch_dp.gif

gameplay_geosearch_dp.gif

Monte Carlo

# define output paths
image_filename = os.path.join("./outputs", "optim_policy_geosearch_mc.png")
gif_filename = os.path.join("./outputs", "gameplay_geosearch_mc.gif")
convergence_plot_filename = os.path.join("./outputs", "convergence_plot_geosearch_mc.png")
# initialize solver
solver = MonteCarlo(env)

# set parameters
episodes = 5000
max_steps = 50

train_and_visualize_policy(env, solver, episodes, max_steps, verbose, save_metrics, gif_filename, image_filename, convergence_plot_filename)
Starting Monte Carlo ES training for 5000 episodes...
Episode 500/5000 - Average Return: 0.21, Average Q-Value Update: 0.0178
Episode 1000/5000 - Average Return: 0.61, Average Q-Value Update: 0.0337
Episode 1500/5000 - Average Return: 0.74, Average Q-Value Update: 0.0257
Episode 2000/5000 - Average Return: 0.90, Average Q-Value Update: 0.0211
Episode 2500/5000 - Average Return: 0.83, Average Q-Value Update: 0.0111
Episode 3000/5000 - Average Return: 0.93, Average Q-Value Update: 0.0149
Episode 3500/5000 - Average Return: 0.93, Average Q-Value Update: 0.0134
Episode 4000/5000 - Average Return: 0.97, Average Q-Value Update: 0.0095
Episode 4500/5000 - Average Return: 1.09, Average Q-Value Update: 0.0119
Episode 5000/5000 - Average Return: 1.07, Average Q-Value Update: 0.0074

Action distribution across episodes: {0: '0.385', 1: '0.334', 2: '0.142', 3: '0.139'}
Final Average Return: 0.83
Final Average Q-Value Update: 0.0167
Final Action Values (Q):
 [[4.91973701e-44 4.29988245e-15 9.85861426e-54 2.05678110e-01]
 [1.74497194e-01 2.61032240e-01 1.92257673e-01 4.03753886e-01]
 [8.33469339e-02 1.14050510e-01 1.30846835e-01 3.58425156e-01]
 ...
 [3.07270104e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.48785796e+00 0.00000000e+00 0.00000000e+00 3.36811733e+00]
 [2.49377604e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]

Optimal Policy:
[[0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 ...
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]]
Optimal policy visualization saved as ./outputs/optim_policy_geosearch_mc.png
Gameplay GIF saved as ./outputs/gameplay_geosearch_mc.gif
Average reward following optimal policy: 1.87
Convergence plot saved as ./outputs/convergence_plot_geosearch_mc.png

# optimal policy
display(Image(filename=image_filename))

# gameplay showing optimal policy
display(Image(filename=gif_filename))
<IPython.core.display.Image object>

gameplay_geosearch_mc.gif

gameplay_geosearch_mc.gif

Temporal Difference

# define output paths
image_filename = os.path.join("./outputs", "optim_policy_geosearch_td.png")
gif_filename = os.path.join("./outputs", "gameplay_geosearch_td.gif")
convergence_plot_filename = os.path.join("./outputs", "convergence_plot_geosearch_td.png")
# initialize solver
solver = TemporalDifference(env)

# set parameters
episodes = 5000
max_steps = 37

train_and_visualize_policy(env, solver, episodes, max_steps, verbose, save_metrics, gif_filename, image_filename, convergence_plot_filename)
Training Temporal Difference algorithm for 5000 episodes...
Episode 1/5000 - Average Return: 0.00, Average Q-Value Update: 0.0000
Episode 501/5000 - Average Return: 0.30, Average Q-Value Update: 0.0008
Episode 1001/5000 - Average Return: 0.87, Average Q-Value Update: 0.0033
Episode 1501/5000 - Average Return: 0.96, Average Q-Value Update: 0.0019
Episode 2001/5000 - Average Return: 0.94, Average Q-Value Update: 0.0008
Episode 2501/5000 - Average Return: 0.94, Average Q-Value Update: 0.0008
Episode 3001/5000 - Average Return: 0.98, Average Q-Value Update: 0.0009
Episode 3501/5000 - Average Return: 0.95, Average Q-Value Update: 0.0007
Episode 4001/5000 - Average Return: 0.96, Average Q-Value Update: 0.0006
Episode 4501/5000 - Average Return: 0.98, Average Q-Value Update: 0.0004
Training complete! Action distribution across episodes: [0.33762162 0.31572432 0.17523784 0.17141622]
Final Action Values (Q):
 [[9.30551917e-60 5.07196190e-04 2.66878815e-60 1.64409324e-18]
 [9.61945092e-07 0.00000000e+00 1.59413480e-04 0.00000000e+00]
 [1.61243083e-38 2.69365526e-44 1.16581271e-05 0.00000000e+00]
 ...
 [0.00000000e+00 0.00000000e+00 7.36324124e-03 0.00000000e+00]
 [6.26682210e-03 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.38993099e-03 0.00000000e+00 0.00000000e+00 0.00000000e+00]]

Optimal Policy:
[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 ...
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]]
Optimal policy visualization saved as ./outputs/optim_policy_geosearch_td.png
Gameplay GIF saved as ./outputs/gameplay_geosearch_td.gif
Average reward following optimal policy: 1.23
Convergence plot saved as ./outputs/convergence_plot_geosearch_td.png

# optimal policy
display(Image(filename=image_filename))

# gameplay showing optimal policy
display(Image(filename=gif_filename))
<IPython.core.display.Image object>

gameplay_geosearch_td.gif

gameplay_geosearch_td.gif

Conclusion

As you can see, each solver works correctly on each environment! Please check out the code on our GitHub repository here.